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1 Introduction 



Spin chains, both in the classical and quantum version, are extremely im- 
portant tools to understand various physical situations and, in some cases, 
provide completely soluble models. An interesting field of applications of 
these models is the theory of molecular biological evolution. Since 1986, 
when Leuthdusser P put a correspondence between the Eigen model of evo- 
lution and a two-dimensional Ising model, many articles have been written 
representing biological systems as spin models. Recently in PJ it has been 
shown that the parallel mutation-selection model can be put in correspon- 
dence with the hamiltonian of an Ising quantum chain and in j3] the Eigen 
model has been mapped into the hamiltonian of one-dimensional quantum 
spin chains. In this approach the genetic sequence is specified by a sequence 
of spin values ±1. DNA is build up a sequence of a four basis or nucleotides 
which are usually identified by their letter: C, G, T, A (T being replaced 
by U in RNA), C and U (G and A) belonging to the purine family, denoted 
by R (respectively to the pyrimidine family, denoted by Y) . Therefore in the 
case of genome sequences each point in the sequence should be identified by 
an element of a four letter alphabet. As a simplification one identifies each 
element according to the purine or pyrimidine nature, reducing to a binary 
value, see jS] for a four-state quantum chain approach. As standard assump- 
tion, the strength of the mutation is assumed to depend from the Hamming 
distance between two sequences, i.e. the number of sites with different la- 
bels. Moreover usually it is assumed that the mutation matrix elements 
are vanishing for Hamming distances larger than 1, i.e. for more than one 
nucleotide changes. The main aim of the works using this approach, see 
El 13 El , is to find, in different landscapes, the mean "fitness" and the "bio- 
logical surplus" , in the framework of biological population evolution. At our 
knowledge these has been no attempt to apply spin models to obtain the ob- 
served equilibrium distribution of oligonucleotides in DNA. Martindale and 
Konopka [Hj, indeed, have remarked that the ranked short (ranging from 3 to 
10 nucleotides) oligonucleotide frequencies, in both coding and non-coding 
region of DNA, follow a Yule distribution. We recall that a Yule distribution 
is given by / = an^ 6", where n is the rank and a. A; < and h are 3 real 
parameters. In order to face this problem, in this paper we propose a spin 
model in which the intensity of the transition matrix depends in some way 
from the whole distribution of the nucleotides in the sequence. At present 
we assume that the transition matrix does not vanish only for total spin fiip 
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equal ±1, 



2 Mutations and Crystal basis 

A sequence of N ordered nucleotides, characterized only by the purine or 
pyrimidine character, that is a string of N binary labels or spin, can be 
represented as a vector belonging to the N-fold tensor product of the funda- 
mental irreducible representation (irrep.) (labelled by J = 1/2) of Uq^o{sl2) 
[TUj . which is usually called crystal basis representation. This parametriza- 
tion allows to represent, in a simple way, the mutation of a sequence as a 
linear transformation between vectors, which can be subjected to selection 
rules and whose strength depends from the two concerned states. So we 
identify a N-nucleotidic sequence as a vector 

|J)=| J3,A...,J\...,J') (1) 

where labels the irrep. which the vector belongs to, J3 is the value of 
the 3rd diagonal generator of Ug^oish) (2J3 = nR — ny, being the number 
of X elements in the sequence) and J* (2 < i < A^— 1) are A^— 2 labels needed 
to remove the degeneracy of the irreps. in the tensor product, in order to 
completely identify the state and which can be seen as identifying the irrep. 
which the sequence truncated to the i-th element belongs to. As an example, 
we consider a trinucleotidic string (A^ = 3) and label the eight different spin 
chains in the following way(| J3, J^, J^), R=\ =], Y = — | =|): 

Tii = l-|,io) tiT=liio) 

iti I 2'2''^^ TTi l2'2'"^^ 

I 2' 2'"^^ li-T I 2' 2'"^^ 

in = tTT=i|,|'i)- 

At this stage the crystal basis provides an alternative way of labelling any 
finite spin sequence, mapping any sequence in a vector state of an irrep., 
but we know that in physics and mathematics the choice of appropriate 
variables is of primary importance to face a problem. Indeed we argue that 
these variables are suitable to partially describe non local events which affect 
the mutations. Flipping the total spin by ±1 (AJ3 = ±1) can induce a 
transition to a vector belonging or not belonging to the irrep. of the original 
sequence. One can easily realize that to identify a nucleotidic sequence as 
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a vector of an irrep. requires to fix the number of RY "contracted couples" 
occurring in the considered sequence (contraction should be understood in 
the same sense of contraction of creation-annihilation operators in the Wick 
expansion). Therefore fiipping a spin implies or the creation or the deletion 
of a RY contracted couple, corresponding respectively to a variation of -1 o 
+1 of the value of the and, possibly, of some others J* (2 < z < A^— 1), or 
to leave unmodified the number of contracted couples (so that A = 0, but 
A J* 7^ for some values of i). Note that alternatively one can identify 1/2 = 
{C,G) and — 1/2 = {T,A). Below we give phenomenological arguments for 
our choice. 

3 Transition operators 

Let us consider a N-nucleotidic string and classify the different transitions 
on the string labels J3, J'^ , . . . , J^. We can distinguish different string con- 
figurations around the i-th position, so that a single nucleotide mutation in 
i-th position can correspond to different variations in the string labels. We 
call left (right) side free the nucleotides on the left (right) of i-th position 
and not contracted with another one on the same side. Let Ri be the initial 
(before mutation) number of the left side free purines and Yj, the initial num- 
ber of the right side free pyrimidines. We want to count the total number of 
contracted RY couples (before and after mutation) in the string, so we call 
Rin (Rfi) the number, in the initial (final) state, of R preceding some Y and 
not contracted with any Y on their side. In the same way, with Yin (^/i) 
we refer to the number of Y following some R and not contracted with any 
R on their side. If a R Y mutation (AJ3 = —1) occurs in i-th position, 
then Rin = Rfi + 1 and Fj„ = Yfi — 1, where Rin = Ri + 1 and Yin = Yr. We 
are interested in finding the stationary or equilibrium configuration of the 
2^ different possible sequence. Writing pj{t) the probability distribution at 
time t of the sequence identified by the vector | J), a decoupled version of 
selection mutation equation (see JT] for an exhaustive review), for a haploid 
organism, can be written as 

j^Pj{t) = Pj{t) (^Rj - J] /2k PK(t) j + Yl ^J.K PK{t) (2) 

where i^K is the Malthusian fitness of the sequence corresponding to the 
vector I K) and ; Mj k are the entries of a mutation matrix M which satisfies 
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Mj,j = -J2 ^J,K (3) 
The equation (0) is reduced to 

i^^it) = J2 (H + M),^^ XK(t) (4) 



dt 



where 



(5) 



and if is a diagonal matrix, with fitness as entries. 

In our model the mutation matrix is written as the sum of the following 
partial mutation matrices (T means transposition): 

• If Ri = Yr we distinguish two subcases: 

1. Ri = Yr^O 

N-1 N 

^^ = 1111 "i' (^^.^'^- + J+^lk) (6) 

i=2 k=i+l 

2. Ri = Yr = Q 

M2 = «2 {J- + J+) (7) 

• If Ri > Yr, we distinguish two subcases: 



1. 1; = 



2. 1; 7^ 



N 



M, = Y,(^iiAJ- + J+Aj) (8) 



1=2 



N-l 



M^ = Y,< (AJ- + J+Aj) (9) 



1=3 



If Ri < Yr, we distinguish two subcases: 
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1. Ri = 



N 

M, = J2a^{J.Al + A^J^) (10) 

m=2 

2. Rij^ {2 < N - 2;i + 1 < k < N - 1). 

M6 = Y.Y. + ^ik^k+M (11) 

i=2 k=i+l 

where J+ and J_ are the step operators (to save space we do not write their 
exphcit form) defined by Kashiwara PU]; acting on the states of an irrep. 
with highest weight , i.e. inducing a mutation AJj = 0, Wi N, 

j±i j)=i J3±l,J^,..,J^..,J2) (12) 

and 

^i,fc I J) = \ J?,-, 1 1 ^ — 1, .., J* — 1, J* ^,..,J'^) 

{2<i<N-l i+l<k<N) (13) 

A,\J) = I J3,J^-1,...,J*-1,J*-\...,J2) 

(2 < i < A^) (14) 

Therefore Ajg. is the operator which connects vectors differing by +1 in the 
value of J\ for k — 1 < I < i. A few words to comment on the above 
equations. Let us consider a mutation R ^ Y, which involves a transition 
AJ]\r = —1 (case Ri > Yr) and entails AJ3 = —1, we have to apply the 
operator J_, as well as the operator Ai. Of course, first we have to lower by 1 
the value of J3, then to modify J^, otherwise the initial state may eventually 
be annihilated, even if the transition is allowed (in the case — 1 < J3). 
Likewise, in corrispondence of a transition Y —>■ R (AJ3 = +1), first the 
change —>■ + 1 has to be performed, then J3 — > J3 + 1. Clearly in 
eq.©, assuming equal rates for mutations and for back mutations, we have 
to sum the operator, which gives rise to the transition Y ^ R with that one 
which leads to R ^ Y, that is 

AJ^ + J+Aj (15) 
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This operator leads to the mutation Y ^ R or R Y for a nucleotide in 
i-th position, in a string with Ri > Yj.. If the mutation R ^ Y corresponds 
to a rising of (i.e. a transition with AJn = 1, A J3 = —1, case Ri < Yr), 
first has to be modified, then J3; therefore we write 

J.Al + AmJ+ (16) 

The above operator gives rise to mutations R ^ Y and Y ^ R for a nu- 
cleotide in i-th position, preceding the m-th one, in the case Ri = 0, 7^ 0. 
Let us remark that eq.© is included in eq.®, if the coupling constants 
a are assumed equal; in eq. (fTT|) . only the writing order for Ak+i (and its 
transposed) and J± has to be respected. Assuming now that the coupling 
constants do not depend on i, k, m, we can write the mutation matrix M as 

M = /ii (M3 + M5) + + /isMs + /i4M6 + Md (17) 

where is the diagonal part of the mutation matrix defined by eq.Q. 
The scale of the values of the coupling constants of M is suggested by the 
phenomenogy. We want to write an interaction term which makes the mu- 
tation in alternating purinic/pyrimidinic tracts less likely than polipurinic 
or polipyrimidinic ones. We mean as a single nucleotide mutation in a 
polipurinic (polipyrimidinic) tract, a mutation inside a string with all nu- 
cleotides R (Y), i.e. a highest (lowest) weight state. Such a transition corre- 
sponds to the selection rules AJ^ = — 1,A J3 = ±1, i.e. a transition generated 
by the action of M3 and M5. In the mutation matrix M, we give them a 
coupling constant smaller than the We introduce, for AJ3 = ±1 , only four 
different mutation parameters /i, {i = 1,2,3,4), with fii < fik k > 1. 

1. fii for mutations which change the irrep., AJ^ = ±1 , and include the 
spin flip inside a highest or lowest weight vector; 

2. fi2 for mutations which do not change the irrep., AJ^ = , but modify 
other values of J^ AJ'' = ±1 {2 < k < N - 1); 

3. /i3 for mutations which do neither change the irrep., AJ^ = 0, nor 
the other values of J'', AJ^ = 0, (2 < /c < - 1); 

4. /i4 for mutations which change the irrep., A J = ±1 , but only in a 
string with Q ^ Ri <Yr. 

We do not introduce another parameter, for mutations generated by M4, i.e. 
i-ih nucleotide mutation in a string with Ri >Yj. ^ 0, to not distinguish, in 
a polipurinic string, a mutation according to its position. 
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4 Results 



The evolution equation of the model for the probabilities will be written in 
terms of the matrix H = H + M + XI, where the fitness can he H = J3 
(purely additive fitness) and A is choosen in such a way to guarantee H is 
positive. Being H+M irreducible, the composition of equilibrium population 
is given by 

P3 = (18) 



where xj is the Perron- Frobenius eigenvector jlzj of H. We look for a nu- 
merical solution, with a suitable choice of the value of the parameters, for 
N = 3,4,6. Before solving numerically the model, we point out explicitly its 
main features. M describes an interaction on the i-th spin neither depending 
on the position nor on the nature of the closest neighbours, but which takes 
into account, at least partially, the effects on the transition in the i-th site of 
the distribution of all the spins, that is non local effects. Indeed it depends 
on the "ordered" spin orientation surplus on the left and on the right of the 
i-th position. Should it not depend on the order, it may be considered as 
a mean-field like effect. Moreover AJ3 = ±1 transitions are allowed, which, 
e.g. for N = 4, can be considered or as the flip of a spin combined with 
an exchange of the two, oppositely oriented, previous or following spins or 
as the collective flip of particular three spin systems, containing a two spin 
system with opposite spin orientations (see example below). Biologically, 
the transition depends in some way on the "ordered" purine surplus on the 
left and on the right of the mutant position. Let us briefly comment on the 
physical-biological meaning of the "ordered" spin sequence. Our aim is to 
study finite oligonucleotide sequence in which a beginning and an end are 
defined. This implies we can neither make a thermodynamic limit on nor 
define periodic conditions on the spin chain. So we have to take into account 
the "edge" or "boundary" conditions on the finite sequence. An analogous 
problem appears in determing thermodynamic properties of short oligomers 
and, in this framework, in ^3] the concept of fictitious nucleotide pairs E 
and E' has been introduced, in order to mimick the edge effects. The or- 
dered couple of RY takes into account in some way the different interactions 
of R and Y with the edges. For example, the transition matrix, on the above 
basis (for N = 3) is the following one, up to a multiplicative dimensional 



7 



factor /io 

/x507eOe 

(5 X e 

X 6 e e 

705xOeO 

e e a; 5 

e e 5 X 5 

e e 5 X 
\ e e 5 

where the diagonal entries are not exphcitly written, and are given by Q. 
Note that the above matrix depends only on three coupling constants due 
to the very short length of the chain. For > 4 the 4th coupling constant 
(denoted in the following by 77) will appear. Let us emphasize that the 
mutation matrix M (fTU|) does not only connect states at unitary Hamming 
distance. As an example, we write explicitly the transitions from | i,i,0) 
(nT)andfrom|-i,i,0) (Tii) 

r TTT 

TiT^< lit Tii 
( Tii 

For lack of space, we do not explicitly write the mutation matrix, which 
allows transitions only between chains at Hamming distance equal to one, 
with coupling constant (3. Notice howevere that such a (Hamming) matrix is 
not obtained by eq. lfT^ putting 6 = •y = e = (3. If we order (in a decreasing 
way) the equilibrium probabilities, we obtain, using the mutation matrix with 
Hamming distance, a rank ordered distribution of transition probability like 
that in fig^for = 4. Its shape does not depend on the value of (3. The 
rank-ordered distribution of the probabilities shows a plateaux structure: 
every plateaux contains spin sequences at the same Hamming distance from 
the sequence with the highest value of the fitness. Using the mutation matrix 
the rank ordered probabilities distribution does not show a plateaux 
structure, but its shape is well fitted by a Yule distribution (figH)), like the 
observed frequency distribution of oligonucleotidic in the strings of nucleic 
acids 0. Let us observe that we obtain a Yule distribution (and not a 
plateaux structure) even if all parameters in (fT^ are tuned at the same 
value, which means that the distribution is the outcome of the model and 
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(19) 
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not of the choice of the values of the couphng constants. Analogous resultes 
are obtained for = 6 (figEI)- Let us point out that: 

i) our model is not equivalent to a model where the intensity depends on 
the site undergoing the transition, or from the nature of the closest neigh- 
bours or the number of the R and Y labels of the sequence; indeed essentially 
the intensity depends on distribution in the sequence of the R and Y; 

ii) the ranked distribution of the probabilities follows a Yule distribution 
law, but as the value of the parameter b is close to the unity, the distribution 
is equally well fitted by a Zipf law (/ = an^), in agreement with the remark 
of p. 

In conclusion we are far from claiming that our simple model is the only 
model able to explain the observed oligonucleotides distribution, for several 
obvious reasons, but that the standard approach using the Hamming distance 
does not give a Yule or Zipf distribution. One may correctly argue that the 
comparison between the Hamming model, depending on only one parameter 
and taking into account only one site spin flip, with our model, which depends 
on four parameters and takes into account spin flip of more than one site, 
is not meaningful. So we have computed the stationary distribution with a 
mutation matrix not vanishing for Hamming distance larger than one and 
allowing the same number of mutations as our model. The result reported 
in figlU shows that the plateaux structure is always the dominant feature. 
Let us comment on the non point mutations which naturally are present in 
our model. In literature there is an increasing number of paper that, on the 
basis of more accurate data, question both the assumptions that mutations 
occur as single nucleotide and as independent point event. In a quite recent 
paper Whelan and Goldman J3] have presented a model allowing for single- 
nucleotide, doublet and triplet mutation, finding that the model provides 
statistically significant improvements in fits with protein coding sequences. 
We note that the triplet mutations, for which there is no known inducing 
mechanism, but which can possibly be explained by large scale event, called 
sequence inversion in jHj , are indeed the kind of mutations, above discussed, 
that our model naturally describes. Doublet mutations do not appear, due to 
the assumed spin flip equal ±1, but on one side some of these mutations are 
hidden by the binary approximation, and on the other side the parameter 
ruling such mutations, as computed in is lower than the one ruling 
the triplet mutation. In conclusion the Hamming distance does not seem a 
suitable measure of the distance in the space of the biological sequences, the 
crystal basis, on the contrary, seems a better candidate to parametrize the 
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elements of such space. Our model makes use of this parametrisation, allows 
to modelise some non point mutations and exhibits intriguing and interesting 
features, hinting in the right direction, worthwhile to be further investigated. 
In the present simple version, the model depends only on 4 parameters for 
any N, which are, very likely, not enough to describe sequences longer that 
the considered ones. However the model is rather flexible and, besides the 
obvious introduction of more coupling constants, allows, e.g., to analyse part 
of the sequences containing hot spots in the mutation (using fictitious edge 
nucleotides), to take into account doublet mutations (indeed the operator 
eq.lfT^ or Af^_^^ describes a doublet spin flip at position i,i+l) and an easy 
generalisation to four letter alphabet. Although the very short chain, which 
we were interested in, can be studied numerically without any use of the 
crystal basis, we propose a general algorithm, which can be applied to chains 
of arbitrary length and which can be easily implemented in computers. 
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Figure 1: Rank ordered distribution of equilibrium population (N=4) ob- 
tained for an Hamming transition matrix, with /3 = 0.60. 
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Figure 2: Rank ordered distribution of equilibrium population (N=4) for a 
transition matrix M with e = 0.25, 7 = 5 = = 0.50. The distribution was 
fitted by a Yule function (continuous line) / = aR%^ {R is the rank). The 
parameters was estimated as a = 0.37, b = 1.02, k = —1.28. 
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Figure 3: Rank ordered distribution of equilibrium population (N=6) for a 
transition matrix M, with e = 0.25, 'j = 6 = rj = 0.5O. The distribution 
was fitted by a Yule function (continuous line) / = aR'^b^ {R is the rank). 
The parameters was estimated as a = 0.26, b = 1.00 k = —1.11. 
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Figure 4: Rank ordered distribution of equilibrium population (N=4) ob- 
tained for a transition matrix allowing the same number of mutations as M, 
between sequences at different Hamming distances 
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